Throughout college, every Tuesday night I could, I would go to a bar in downtown Gainesville with a group of friends to play trivia. While at trivia one night we began talking about what makes a good trivia team. We decided that there were certain topics that always popped up in trivia and a good team would have one or two people who knew a lot about those topics. The ones we decided were:
As you could imagine, immediately after having this conversation, we began arguing about what player was the most important. So I began thinking....
"What topics come up in trivia the most?"¶
As a trivia nerd, I've spent more than my fair share of time watching Jeopardy. And when I found a dataset of all the Jeopardy questions and answers used in the shows 15 year history, I knew what I had to do.
First things first, we need to bring the csv in and clean it up a bit. First, we can remove all the punctuation and standardize the data a bit. Then there's a few pieces that aren't super relevant to our project, so let's get those out.
import pandas as pd
import matplotlib.pyplot as plt
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
'Question', 'Answer']
jeopardy.columns
Now that we only have the columns we want, I'm going to remove some of the punctuation to make things easier on down the road
import re
def removePunct(word):
word = re.sub('[^A-Za-z0-9\s]','', word)
word = word.lower()
return word
jeopardy['clean_question'] = jeopardy['Question'].apply(removePunct)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(removePunct)
jeopardy.head(5)
new = jeopardy[['clean_question', 'clean_answer']]
import nltk
nltk.download()
new.head()
Now that everything is a bit cleaner, let's get to some NLP work. First we'll tokenize the strings and remove some stop words to help with efficiency. I played around with stemming or lemmatizing the data but decided that it ended up hurting the end result more than it helped.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def removeStop(para):
words = word_tokenize(para)
useful_words = []
for i in words:
if i not in stopwords.words('english'):
useful_words.append(i)
return (' ').join(useful_words)
new['final_question'] = new['clean_question'].apply(removeStop)
new['final_answer'] = new['clean_answer'].apply(removeStop)
new.head()
import spacy
from spacy import displacy
from collections import Counter
all_ners = []
nlp = spacy.load('en')
for ex in new['final_question']:
doc = nlp(ex)
for chunk in doc.noun_chunks:
all_ners.append(chunk.text)
len(all_ners)
c = Counter(all_ners)
c = c.most_common(200)
There's a bunch of clutter in there that got past the stopwords wall the first time around. I'm just going to get rid of the obvious ones off the top and then we'll go through this with the answers.
c = c[9:]
c = dict(c)
all_ners = []
for ex in new['final_answer']:
doc = nlp(ex)
for chunk in doc.noun_chunks:
all_ners.append(chunk.text)
c2 = Counter(all_ners)
c2 = c2.most_common(200)
c2 = dict(c2)
from wordcloud import WordCloud
import matplotlib.pyplot as plt
word_could_dict= c
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()
from wordcloud import WordCloud
import matplotlib.pyplot as plt
word_could_dict= c2
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()
And that wraps up this analysis! There's plenty of other routes to go down with this dataset that could lead to some interesting findings. With the results of our NER, we know that in order to have the best shot at Jeopardy!, you should focus on:
Of course, NERs are suseptible problems like anything else. This analysis picks up phrases that appear most often, not necessarily topics that appear most often. We could utilize a knowledge base or other method of tracing larger topics to find a better list of subject.
Let me know if you have any comments, questions or thoughts!
allison.kahn.12@gmail.com